- Console
- Change Kernel…
- Clear Console Cells
- Close and Shut Down…
- Insert Line Break
- Interrupt Kernel
- New Console
- Restart Kernel…
- Run Cell (forced)
- Run Cell (unforced)
- Show All Kernel Activity
- Extension Manager
- Enable Extension Manager
- File Operations
- Autosave Documents
- DownloadDownload the file to your computer
- Open from Path…Open from path
- Reload Notebook from DiskReload contents from disk
- Revert Notebook to CheckpointRevert contents to previous checkpoint
- Save NotebookSave and create checkpoint⌘ S
- Save Notebook As…Save with new path⇧ ⌘ S
- Show Active File in File Browser
- Trust HTML File
- Help
- Jupyter Reference
- JupyterLab FAQ
- JupyterLab Reference
- Launch Classic Notebook
- Markdown Reference
- Reset Application State
- Image Viewer
- Flip image horizontallyH
- Flip image verticallyV
- Invert ColorsI
- Reset Image0
- Rotate Clockwise]
- Rotate Counterclockwise[
- Zoom In=
- Zoom Out-
- Kernel Operations
- Shut Down All Kernels…
- Language server protocol
- Display the completer themes
- Launcher
- New Launcher
- Main Area
- Activate Next Tab⌃ ⇧ ]
- Activate Next Tab Bar
- Activate Previous Tab⌃ ⇧ [
- Activate Previous Tab Bar
- Activate Previously Used Tab⇧ ⌘ '
- Close All Other Tabs
- Close All Tabs
- Close Tab⌥ W
- Close Tabs to Right
- Find Next⌘ G
- Find Previous⇧ ⌘ G
- Find…⌘ F
- Log OutLog out of JupyterLab
- Presentation Mode
- Show Left Sidebar⌘ B
- Show Status Bar
- Shut DownShut down JupyterLab
- Single-Document Mode⇧ ⌘ D
- Notebook Cell Operations
- Change to Code Cell TypeY
- Change to Heading 11
- Change to Heading 22
- Change to Heading 33
- Change to Heading 44
- Change to Heading 55
- Change to Heading 66
- Change to Markdown Cell TypeM
- Change to Raw Cell TypeR
- Clear Outputs
- Collapse All Code
- Collapse All Outputs
- Collapse Selected Code
- Collapse Selected Outputs
- Copy CellsC
- Cut CellsX
- Delete CellsD, D
- Disable Scrolling for Outputs
- Enable Scrolling for Outputs
- Expand All Code
- Expand All Outputs
- Expand Selected Code
- Expand Selected Outputs
- Extend Selection Above⇧ K
- Extend Selection Below⇧ J
- Extend Selection to Bottom⇧ End
- Extend Selection to Top⇧ Home
- Insert Cell AboveA
- Insert Cell BelowB
- Merge Selected Cells⇧ M
- Move Cells Down
- Move Cells Up
- Paste Cells Above
- Paste Cells and Replace
- Paste Cells BelowV
- Redo Cell Operation⇧ Z
- Run Selected Cells
- Run Selected Cells and Don't Advance⌘ Enter
- Run Selected Cells and Insert Below
- Run Selected Text or Current Line in Console
- Select Cell AboveK
- Select Cell BelowJ
- Split Cell⌃ ⇧ -
- Undo Cell OperationZ
- Notebook Operations
- Change Kernel…
- Clear All Outputs
- Close and Shut Down
- Deselect All Cells
- Enter Command Mode⌃ M
- Enter Edit ModeEnter
- Interrupt Kernel
- New NotebookCreate a new notebook
- Reconnect To Kernel
- Render All Markdown Cells
- Restart Kernel and Clear All Outputs…
- Restart Kernel and Run All Cells…
- Restart Kernel and Run up to Selected Cell…
- Restart Kernel…
- Run All Above Selected Cell
- Run All Cells
- Run Selected Cell and All Below
- Select All Cells⌘ A
- Toggle All Line Numbers⇧ L
- Trust Notebook
- Settings
- Advanced Settings Editor
- Terminal
- Decrease Terminal Font Size
- Increase Terminal Font Size
- New TerminalStart a new terminal session
- Refresh TerminalRefresh the current terminal session
- Use Dark Terminal ThemeSet the terminal theme
- Use Inherit Terminal ThemeSet the terminal theme
- Use Light Terminal ThemeSet the terminal theme
- Text Editor
- Decrease Font Size
- Increase Font Size
- Indent with Tab
- New Markdown FileCreate a new markdown file
- New Text FileCreate a new text file
- Spaces: 1
- Spaces: 2
- Spaces: 4
- Spaces: 8
- Theme
- Decrease Code Font Size
- Decrease Content Font Size
- Decrease UI Font Size
- Increase Code Font Size
- Increase Content Font Size
- Increase UI Font Size
- Theme Scrollbars
- Use JupyterLab Dark Theme
- Use JupyterLab Light Theme
xxxxxxxxxx#do a gaussian scanning of the pcaxxxxxxxxxxxxxxxxxxxxKernel Sessions
- __notebook_source__.ipynb
Terminal Sessions
- __notebook_source__.ipynb
xxxxxxxxxx
xxxxxxxxxxWelcome to the year 2912, where our data science skills are needed to solve a cosmic mystery. We have received a transmission from four light-years away, and things aren't looking good. The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars. While heading towards its first destination, the scorching 55 Cancri E, the unsuspecting Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Unfortunately, it met a fate similar to its namesake from 1000 years before. Although the ship remained intact, almost half of the passengers were transported to an alternate dimension! With this challenge, we want to clusterize data in order to answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). We want to analyze the data with a Python notebook to create clustering models and examine the received data. We will try to better understand the events that led to this cosmic situation and contribute to unraveling the mystery behind this space collision. The goal is to determine which passenger has been Transported to another dimension or not.
We are going to develope the report in 3 chapters:
1) In the first chapter we will focus on the exploration of the dataset, the distribution of the different features and their relations.
2) In the second one we will engage in feature engeneering, the filling of missing values and data preprocessing
3) In the third chapter we will perform clustering, and its validation with internal and external indexes. Interpretation of the final results.
xxxxxxxxxximport numpy as npimport pandas as pdimport matplotlib.pyplot as plt%matplotlib inlinefrom collections import Counterimport seaborn as snsxxxxxxxxxxLet's begin with the exploration of the datasets
# Save to dfdata = pd.read_csv('/kaggle/input/spaceship-titanic/train.csv')label = data['Transported']# Shapeprint('Data shape:', data.shape)Data shape: (8693, 14)
data.head()| PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False |
| 1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True |
| 2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False |
| 3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False |
| 4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True |
and some numerical. Their interpretation is the following:The dataset is formed by 8693 entries, each one representing a passenger, and 14 different features. Some of which are categoricaland some numerical. Their interpretation is the following:- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.- Destination - The planet the passenger will be debarking to.- Age - The age of the passenger.- VIP - Whether the passenger has paid for special VIP service during the voyage.- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.- Name - The first and last names of the passenger.- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.The dataset is formed by 8693 entries, each one representing a passenger, and 14 different features. Some of which are categorical and some numerical. Their interpretation is the following:
- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Age - The age of the passenger.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- Name - The first and last names of the passenger.
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
xxxxxxxxxxSince we want to perform unsupervised learning we do not want the label of the target, so we drop it. The coulmn 'Name' being a categorical feature specific for every entry could hardly be useful in any way during the process, so we drop it as well.
# remove Transporteddata = data.drop(['Transported'], axis=1)#remove Namedata = data.drop(['Name'], axis=1)xxxxxxxxxxAs a first step, for a better understanding, we divide the features in numerical and categorical.
num_features = []cat_features = []for i,f in enumerate(data.keys()): if data.dtypes[i] == 'float64': num_features.append(f) else: cat_features.append(f)xxxxxxxxxxHere we create 2 lists with the name of categorical features and numerical features
num_features['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
cat_features['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP']
data.describe()| Age | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | |
|---|---|---|---|---|---|---|
| count | 8514.000000 | 8512.000000 | 8510.000000 | 8485.000000 | 8510.000000 | 8505.000000 |
| mean | 28.827930 | 224.687617 | 458.077203 | 173.729169 | 311.138778 | 304.854791 |
| std | 14.489021 | 666.717663 | 1611.489240 | 604.696458 | 1136.705535 | 1145.717189 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 19.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 27.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 38.000000 | 47.000000 | 76.000000 | 27.000000 | 59.000000 | 46.000000 |
| max | 79.000000 | 14327.000000 | 29813.000000 | 23492.000000 | 22408.000000 | 24133.000000 |
data[num_features][:5]| Age | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | |
|---|---|---|---|---|---|---|
| 0 | 39.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 24.0 | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 |
| 2 | 58.0 | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 |
| 3 | 33.0 | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 |
| 4 | 16.0 | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 |
plt.figure(figsize = (7,3))sns.histplot(data=data, x='Age', binwidth=1, kde=True, color='mediumpurple')<Axes: xlabel='Age', ylabel='Count'>
plt.figure(figsize = (7,3))sns.histplot(data=data, x='RoomService', binwidth=400, kde=True, color='mediumpurple')<Axes: xlabel='RoomService', ylabel='Count'>
plt.figure(figsize = (7,3))sns.histplot(data=data, x='FoodCourt', binwidth=400, kde=True, color='mediumpurple')<Axes: xlabel='FoodCourt', ylabel='Count'>
plt.figure(figsize = (7,3))sns.histplot(data=data, x='ShoppingMall', binwidth=400, kde=True, color='mediumpurple')<Axes: xlabel='ShoppingMall', ylabel='Count'>
plt.figure(figsize = (7,3))sns.histplot(data=data, x='Spa', binwidth=400, kde=True, color='mediumpurple')<Axes: xlabel='Spa', ylabel='Count'>
plt.figure(figsize = (7,3))sns.histplot(data=data, x='VRDeck', binwidth=400, kde=True, color='mediumpurple')<Axes: xlabel='VRDeck', ylabel='Count'>
As we can clearly see, the 5 "services" features have the same distribution. With a small amount of people who spent a lot and the major part spending zero. Suspecting that the this five features could be related and that the same people who spend in a field also spend in others, we would be tempted to construct a single "expenses" feature. Let's the analyze the combined ditribution and the correlation between them.As we can clearly see, the 5 "services" features have the same distribution. With a small amount of people who spent a lot and the major part spending zero. Suspecting that the this five features could be related and that the same people who spend in a field also spend in others, we would be tempted to construct a single "expenses" feature. Let's the analyze the combined ditribution and the correlation between them.As we can clearly see, the 5 "services" features have the same distribution. With a small amount of people who spent a lot and the major part spending zero. Suspecting that the this five features could be related and that the same people who spend in a field also spend in others, we would be tempted to construct a single "expenses" feature. Let's the analyze the combined ditribution and the correlation between them.
services_features =['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']services = data[services_features]services.head()| RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | |
|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 |
| 2 | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 |
| 3 | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 |
| 4 | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 |
features = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']# Set the desired height and aspect ratioheight = 1aspect = 1.2sns.pairplot(services, vars=features, plot_kws={'s': 6, 'color': 'mediumpurple'}, diag_kws={'color': 'mediumpurple'}, corner=True, height=height, aspect=aspect)plt.show()sns.pairplot(services, vars=features, plot_kws={'s': 6, 'color': 'mediumpurple'}, diag_kws={'color': 'mediumpurple'}, corner=True, height=height, aspect=aspect)corrmat = services.corr()f, ax = plt.subplots(figsize=(7, 5))sns.heatmap(corrmat, vmax=.8, square=True);xxxxxxxxxxFrom the plot of the combind distibution between couples of variables we notice the lack of correlation. This is indeed confirmed by the correlation matrix which shows that the highest value is around 0.35. We conclude that the construction of a single feature would loose information and the expenses feature must be kept separate.
We know focus on the catgorical features, starting with the ones that do not requre special attention for our purposes.We know focus on the catgorical features, starting with the ones that do not requre special attention for our purposes.We know focus on the catgorical features, starting with the ones that do not requre special attention for our purposes.
fig=plt.figure(figsize=(7,4))sns.countplot(data=data, x='HomePlanet', palette='flare')<Axes: xlabel='HomePlanet', ylabel='count'>
xxxxxxxxxxEarth is the most common planet to depart from.
fig=plt.figure(figsize=(6,5))sns.countplot(data=data, x='CryoSleep', palette='flare')<Axes: xlabel='CryoSleep', ylabel='count'>
fig=plt.figure(figsize=(6,5))sns.countplot(data=data, x='Destination', palette='flare')<Axes: xlabel='Destination', ylabel='count'>
xxxxxxxxxxTRAPPIST-1e is by far the most common destination.
fig=plt.figure(figsize=(5,4))sns.countplot(data=data, x='VIP', palette='flare')<Axes: xlabel='VIP', ylabel='count'>
xxxxxxxxxxWe see that only a small percentage of passenger have a VIP card.
xxxxxxxxxxLet's now focus on the two most elaborate categorical features. Passenger ID and Cabin.
data['PassengerId'][:5]0 0001_01 1 0002_01 2 0003_01 3 0003_02 4 0004_01 Name: PassengerId, dtype: object
xxxxxxxxxxWe know that Passenger ID is composed by 6 digits, the first 4 labeling the number of the group and the last two indicate the number of a the person inside is group. So we create two different features respectively, 'Group' and 'Member', and plot their distribution.
group, member = [],[]for pID in data['PassengerId']: group.append(int(pID[0:4])) member.append(int(pID[5:7]))plt.figure(figsize = (15,4))fres, bins = np.histogram(member, np.linspace(0.5,8.5,9))bins = [1,2,3,4,5,6,7,8]plt.subplot(1,2,1)plt.title('Member distribution')plt.bar(bins,fres, color='mediumpurple')plt.xlabel('Member')plt.ylabel('Frequency')plt.subplot(1,2,2)plt.title('Group distribution')unique, counts = np.unique(group, return_counts=True)plt.bar(unique,counts, color='mediumpurple')plt.xlabel('Group')plt.ylabel('# Components')Text(0, 0.5, '# Components')
From theese plots we see that there are ~9000 groups and most of them are composed by a single member. The maximum number of members in a group is 8 and the member distribution follows a negative exponential.From theese plots we see that there are ~9000 groups and most of them are composed by a single member. The maximum number of members in a group is 8 and the member distribution follows a negative exponential.From theese plots we see that there are ~9000 groups and most of them are composed by a single member. The maximum number of members in a group is 8 and the member distribution follows a negative exponential.
xxxxxxxxxxWe now drop 'Passenger ID' and add the new two features.
#now split the feature IdPassenger in two features, group and memberdata['Group'] = groupdata['Member'] = memberdata.head()| PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Group | Member | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 |
| 1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | 2 | 1 |
| 2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | 3 | 1 |
| 3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | 3 | 2 |
| 4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | 4 | 1 |
passengerID = []if 'PassengerId' in data.keys(): #here we save the passenger id passengerID = data['PassengerId'] data = data.drop(['PassengerId'], axis=1)data['Cabin'][:5]0 B/0/P 1 F/0/S 2 A/0/S 3 A/0/S 4 F/1/S Name: Cabin, dtype: object
Folowing the same logic, from 'Cabin' we build three different features: 'Deck' (deck of the cabin), 'Num' (number of cabin in the deck), 'Side' (which can be 'S' pr 'P').Folowing the same logic, from 'Cabin' we build three different features: 'Deck' (deck of the cabin), 'Num' (number of cabin in the deck), 'Side' (which can be 'S' pr 'P').Folowing the same logic, from 'Cabin' we build three different features: 'Deck' (deck of the cabin), 'Num' (number of cabin in the deck), 'Side' (which can be 'S' pr 'P').
cabin_dict = { 'Deck': [], 'Num': [], 'Side': []}for c in data["Cabin"]: #split c by / if type(c) == float: cabin_dict['Deck'].append(np.nan) cabin_dict['Num'].append(np.nan) cabin_dict['Side'].append(np.nan) continue c = c.split('/') for i,let in enumerate('ABCDEFGT'): if c[0] == let: cabin_dict['Deck'].append(let) cabin_dict['Num'].append(int(c[1])) for i,let in enumerate('SP'): if c[2] == let: cabin_dict['Side'].append(let)# cabin_dict into dataframecabin_df = pd.DataFrame(cabin_dict)cabin_df[:5]| Deck | Num | Side | |
|---|---|---|---|
| 0 | B | 0.0 | P |
| 1 | F | 0.0 | S |
| 2 | A | 0.0 | S |
| 3 | A | 0.0 | S |
| 4 | F | 1.0 | S |
num_cabin_df = {}# Maps non numeric datas in numeric valuesnum_cabin_df['Deck'] = cabin_df['Deck'].map({'A': 0.5, 'B': 1.5, 'C': 2.5, 'D': 3.5, 'E': 4.5, 'F': 5.5, 'G':6.5, 'T':7.5})num_cabin_df['Side'] = cabin_df['Side'].map({'S': -1, 'P': 1})num_cabin_df['Num'] = cabin_df['Num']# plot the distribution of the features using subplot(1,3)fig, ax = plt.subplots(1,3, figsize=(15,5))#plot frequency of the features Deck, Num and Sidesns.histplot(data=num_cabin_df, x='Deck', binwidth=1, color='mediumpurple', ax=ax[0])sns.histplot(data=num_cabin_df, x='Num', binwidth=1, color='mediumpurple', ax=ax[1])sns.histplot(data=num_cabin_df, x='Side', binwidth=1, color='mediumpurple', ax=ax[2])plt.show()xxxxxxxxxxNow let's plot produce some scatter plots to see the relation between the new features.
We will use stripplot from seaborn library in order to avoid overlapping, and understand the density of the points.
fig, axes = plt.subplots(1, 3, figsize=(14, 4))sns.stripplot(data=cabin_df, x="Side", y="Num", ax=axes[0], s=1, hue='Side', palette='flare')axes[0].set_title("Scatterplot: Side vs Num")sns.stripplot(data=cabin_df, x="Deck", y="Num", ax=axes[1], s=1, hue='Deck', palette='flare')axes[1].set_title("Scatterplot: Deck vs Num")sns.stripplot(data=cabin_df, x="Deck", y="Side", ax=axes[2], s=1, hue='Deck', palette='flare')axes[2].set_title("Scatterplot: Deck vs Side")plt.show()xxxxxxxxxxFrom this plot we notice that the the decks are not evenly populated. The most poplated is the 'F' while the 'T' has only a handful of passengers. As a consequence ther are more cabins with a low number since they are present in every deck while high numbers can be reached only in the more populated decks. The side is instead quite evenly distributed.
xxxxxxxxxxWe now merge the 3 new features to the dataset.
# drop Cabinif 'Cabin' in data.keys(): data = data.drop(['Cabin'], axis=1)# merge the two dataframesdata = pd.concat([data, cabin_df], axis=1)data.head()| HomePlanet | CryoSleep | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Group | Member | Deck | Num | Side | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Europa | False | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | B | 0.0 | P |
| 1 | Earth | False | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | 2 | 1 | F | 0.0 | S |
| 2 | Europa | False | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | 3 | 1 | A | 0.0 | S |
| 3 | Europa | False | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | 3 | 2 | A | 0.0 | S |
| 4 | Earth | False | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | 4 | 1 | F | 1.0 | S |
xxxxxxxxxxHaving explored the dataset we will now concentrate on the preprocessing of data.
We start by analyzing the missing values present in our dataset and finding the smartest way possible to deal with them. We will better analyze all the relations between the features and also we will take into account the relations between the new build features and the others.
data.isna().any()HomePlanet True CryoSleep True Destination True Age True VIP True RoomService True FoodCourt True ShoppingMall True Spa True VRDeck True Group False Member False Deck True Num True Side True dtype: bool
xxxxxxxxxxWe see that all the features have missing values apart from 'Group' and 'Member' which in our original datasets were part of 'Passenger ID'. Now let's count the missing values for every column.
# Columns with missing valuesna_cols=data.columns[data.isna().any()].tolist()# Missing values summarymv=pd.DataFrame(data[na_cols].isna().sum(), columns=['Number_missing'])mv['Percentage_missing']=np.round(100*mv['Number_missing']/len(data),2)mv| Number_missing | Percentage_missing | |
|---|---|---|
| HomePlanet | 201 | 2.31 |
| CryoSleep | 217 | 2.50 |
| Destination | 182 | 2.09 |
| Age | 179 | 2.06 |
| VIP | 203 | 2.34 |
| RoomService | 181 | 2.08 |
| FoodCourt | 183 | 2.11 |
| ShoppingMall | 208 | 2.39 |
| Spa | 183 | 2.11 |
| VRDeck | 188 | 2.16 |
| Deck | 199 | 2.29 |
| Num | 199 | 2.29 |
| Side | 199 | 2.29 |
xxxxxxxxxxNow we will try to count how many missing values are present in the same rows, in order to decide if we can drop the entire row or if we have to fill the missing values.
# Countplot of number of missing values by passengerdata['na_count']=data.isna().sum(axis=1)plt.figure(figsize=(7,3))sns.countplot(data=data, x='na_count', palette = 'flare')plt.title('Number of missing entries by passenger')count_miss=Counter(data['na_count'])print(count_miss)Counter({0: 6764, 1: 1587, 3: 167, 2: 135, 4: 36, 5: 4})
xxxxxxxxxxThe missing values are around 200 for every feature, but they are not concentrated on the same entries. Infact, as we can see from the histogram most of the passengers with missing values have only one while passengers with 2, 3, 4 and 5 are less and less probable. As a consequence around 2000 entries have at least a missing value and eliminating every one of them will make us loose a significant part of the dataset. At the same time a passenger has 13 features. Since we saw that, excluding the very few passengers with 4 and 5 missing values, the maximum number of missing feature for a single passanger is 3, the percentage of missing feature is relatively low. So a good way to deal with the problem seems to be eliminating the few passengers with 4 and 5 missing features and try to fill all the other missing values in the most efficient way.
if 'na_count' in data.keys(): data.drop('na_count', axis=1, inplace=True)print("Before drop: ", len(data))#drop rows with 4 or 5 missing valuesindex = []for i in range(len(data)): if data.iloc[i].isna().sum()>=4: index.append(i)data.drop(index, inplace=True)data.reset_index(drop=True, inplace=True)#drop also from labellabel.drop(index, inplace=True)label.reset_index(drop=True, inplace=True)print("Dropped: ", len(index))print(" ")print("After drop: ",len(data))Before drop: 8693 Dropped: 40 After drop: 8653
xxxxxxxxxxLet's check if the new columns we introduced by separating 'Passengerer ID' and 'Cabin' brought to light correlations between features.
# correlation matrix with the new featuresnum_features = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Group', 'Member', 'Num']corrmat = data[num_features].corr()f, ax = plt.subplots(figsize=(7, 5))sns.heatmap(corrmat, vmax=.8, square=True);xxxxxxxxxxWe indeed notice a strong correlation between 'Group' and 'Num'. Let's try to visualize it.
plt.figure(figsize = (7,4))plt.xlabel('Grop Number')plt.ylabel('Cabin Number')plt.scatter(data['Group'], data['Num'], color='mediumpurple', s=1)<matplotlib.collections.PathCollection at 0x7acb06b84ac0>
xxxxxxxxxxWe clearly see that as we increase the group number, the cabin number increases too, but with different coefficients. From the plot of the cabin number per deck we saw how, for each deck, the cabin number went from 0 to a certain value and decks were differently populated and so had different maximum cabin number. We then strongly suspect that the different slopes here represent the different decks. Let's then plot the points with different colors depending on the deck.
size_plt = []for s in data['Side']: if s == 'S': size_plt.append(0.5) if s == 'P': size_plt.append(0.1) if type(s) == float: size_plt.append(0)import plotly.express as pxfig = px.scatter(data, x="Group", y="Num", color="Deck", trendline="ols", template="simple_white", size=size_plt)fig.show()fig = px.scatter(data, x="Group", y="Num", color="Deck", trendline="ols", template="simple_white", size=size_plt)xxxxxxxxxxThis is a remarkable plot that confirms our thoughts. We then foresee a strategy to fill ther missing cabin values: when we have an entry with a missing cabin values(so he misses 'Deck', 'Num', 'Group' ) we check his group number. We know that passengers in the same group are not necessarely put in the same cabin but are for sure close, so most certainly in the same deck and side. So we explore other members of his group and take the most common deck among them to fill 'Deck' and the most common side to fill 'Side'. Then once we have the 'Deck' we have the specific slope of the line that relates 'Group' and 'Num', so we can fill 'Num' with the closest integer to y=mx + b where x is 'Group' and m and b are derived from the specific deck. If the passenger is alone in the group we just fill the deck with 'F' wich is by far the most common and randomly choose the side between 'S' and 'P' since they are evenly distributed. Then we apply the same logic to fill 'Num'.
xxxxxxxxxxFor this process it is useful to introduce a feature that tells us directly if a member is alone in a group or not.
# add data['Alone'] feature of all 0alone = np.zeros(len(data['Group']))# create feature "Alone" to check if the passenger is alone or notid_ = 0for g in data['Group']: count=len(data['Member'][data['Group']==g]) #print(type(len(data['Member'][data['Group']==g]))) if count == 1: alone[id_] = 1 id_ += 1 data["Alone"] = alone print("Alone passengers", data["Alone"].sum())print("Percentage: ", np.round(100*data["Alone"].sum()/len(data),2), "%")Alone passengers 4802.0 Percentage: 55.5 %
xxxxxxxxxxLet's now store the different lines for the different decks and perform a linear fit to obtain the slope and intercept parameters for each one of them. We then store the parameters in a dictionary
fit_params={}slopes=[]inter=[]for i, lett in enumerate('ABCDEFGT'): x = data['Group'][data['Deck'] == lett] y = data['Num'][data['Deck'] == lett] m, b = np.polyfit(x, y, 1) #store the parameters fit_params[lett]=[] fit_params[lett].append(m) fit_params[lett].append(b)print(fit_params){'A': [0.01134543281646721, -2.278053282782598], 'B': [0.03583381004406412, 0.6269908711563403], 'C': [0.03497349139023989, 1.1093915196167643], 'D': [0.030138523991481946, 5.557242557622643], 'E': [0.06491654803222263, -0.9653922677020335], 'F': [0.1993934484531678, -3.1269723777144995], 'G': [0.1626124078906598, -4.798066203007098], 'T': [0.0005541280507195021, 0.02029175300884319]}
print(data['Num'][data['Deck']=='T'])996 0.0 2237 1.0 2717 2.0 2745 3.0 4542 2.0 Name: Num, dtype: float64
xxxxxxxxxxWe see that there is only 5 passenger in deck 'T' which makes the line totally unprecise, but as we see in the cell below the problem of how to fill the 'Num' of passengers with the deck in T does not even come up, since there is no passenger with a missing cabin value and which is in a group in which the majority of other passengers have 'T' as deck. \In the for loop below we added a variable that counts the 'T' deck values added and it is indeed 0.We see that there is only 5 passenger in deck 'T' which makes the line totally unprecise, but as we see in the cell below the problem of how to fill the 'Num' of passengers with the deck in T does not even come up, since there is no passenger with a missing cabin value and which is in a group in which the majority of other passengers have 'T' as deck.
In the for loop below we added a variable that counts the 'T' deck values added and it is indeed 0.
# search for a nan in the y and fill it with the value of the linefrom collections import Countercount_ts=0for i in range(len(data['Deck'])): if pd.isna(data['Deck'][i]): g = data['Group'][i] if data['Alone'][i]==0.0: # if not alone then check the deck and side of other people in the group deck_nonan = [] side_nonan = [] num_nonan = [] for deck_all in (data['Deck'][data['Group']==g]): #take away the nans in the count if type(deck_all)!=float: deck_nonan.append(deck_all) for side_all in (data['Side'][data['Group']==g]): if type(side_all)!=float: side_nonan.append(side_all) for num_all in (data['Num'][data['Group']==g]): if type(num_all)!=float: num_nonan.append(num_all) if side_nonan == [] or deck_nonan == [] or num_nonan == [] : #if the other members of the group all have nan i fill with F cause i can't guess the deck deck='F' else: counter_deck = Counter(deck_nonan) counter_side = Counter(side_nonan) counter_num = Counter(num_nonan) deck, count_deck = counter_deck.most_common(1)[0] #most common value in the list side, count_side = counter_side.most_common(1)[0] if deck=='T': count_ts+=1 data.loc[i, 'Deck']=deck data.loc[i, 'Side']=side else: deck ='F' choice=np.random.rand() if choice >=0.5: side='S' else: side='P' data.loc[i, 'Deck']=deck data.loc[i, 'Side']=side m=fit_params[deck][0] b=fit_params[deck][1] num_fit=int(m*g + b) data.loc[i, 'Num']=num_fit print('T\'s added : {}'.format(count_ts))print('Missing values of Deck, Num, Side: {} {} {}'.format(data['Deck'].isna().sum(), data['Num'].isna().sum(), data['Side'].isna().sum()))print('Missing values of Deck, Num, Side: {} {} {}'.format(data['Deck'].isna().sum(), data['Num'].isna().sum(), data['Side'].isna().sum()))T's added : 0 Missing values of Deck, Num, Side: 0 0 0
xxxxxxxxxxSince 'Group' has no missing values we filled all the cabin missing slots.
xxxxxxxxxxNow let's visualize the combined plots including categorical variables to extrapolate other possible relations. We will just see the combined distribution of the features that we know there are no missing values.
data.keys()Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService',
'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Group', 'Member', 'Deck',
'Num', 'Side', 'Alone'],
dtype='object')#plot ["Deck", "Num", "Side"] vs ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Alone"]fig, ax = plt.subplots(5, 8, figsize=(25, 20))for i, feat in enumerate(["Deck", "Num", "Side", "Group", "Member"]): for j, f in enumerate(["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "Alone", 'HomePlanet']): sns.scatterplot(data=data, x=f, y=feat, ax=ax[i,j], s=5, color='mediumpurple') ax[i,j].set_xlabel(f) ax[i,j].set_ylabel(feat) if feat == 'Member' and f == 'Age': sns.scatterplot(data=data, x=f, y=feat, ax=ax[i,j], s=5, color='cyan') ax[i,j].set_xlabel(f) ax[i,j].set_ylabel(feat) if feat == 'Num' and f == 'HomePlanet': sns.scatterplot(data=data, x=f, y=feat, ax=ax[i,j], s=5, color='cyan') ax[i,j].set_xlabel(f) ax[i,j].set_ylabel(feat)plt.show()This could be due to the fact that usually a group is constituted by family members and, if parents are identified as number 1 and 2 and children as the following, we could notice a descending trend of the age as the 'Member' increases.We notice a relation between age and group member. (Cyan in the graph above).\This could be due to the fact that usually a group is constituted by family members and, if parents are identified as number 1 and 2 and children as the following, we could notice a descending trend of the age as the 'Member' increases.We notice a relation between age and group member. (Cyan in the graph above).
This could be due to the fact that usually a group is constituted by family members and, if parents are identified as number 1 and 2 and children as the following, we could notice a descending trend of the age as the 'Member' increases.
mean_age = []member = []for m in np.unique(data['Member']): member.append(m) mean_age.append(data['Age'][data['Member'] == m].mean())plt.figure(figsize = (7,4))plt.plot(member,mean_age, 'o-', color='mediumpurple')#plot real mean age of entire dataset as lineplt.axhline(y=data['Age'].mean(), color='black', linestyle='--', label='Mean age')plt.xlabel('Member')plt.ylabel('Mean Age')plt.legend()plt.show()xxxxxxxxxxWe indeed notice a descending trend since member number reaches 6. Then the mean age start to increase again. Either way we see that the average age per member differs a lot to the total average age, so we proceed to fill the missing values of age with the average age by member instead of filling them simply with the total average age.
for i in range(len(data['Age'])): if np.isnan(data['Age'][i]): data.loc[i, 'Age'] = mean_age[data['Member'][i]-1]print('Missing Age values: {}'.format(data['Age'].isna().sum()))Missing Age values: 0
We notice also a difference in the 'HomePlanet' take the cabin number in to account.(also cyan in the graph above).\We notice also a difference in the 'HomePlanet' take the cabin number in to account.(also cyan in the graph above).\Let's visualize it better.We notice also a difference in the 'HomePlanet' take the cabin number in to account.(also cyan in the graph above).
Let's visualize it better.
## Num and HOmePlanet#plot the distribution of the features Num vs HomePlanet using scatterplotplt.figure(figsize = (7,3))sns.stripplot(data=data, x="HomePlanet", y="Num", s=1, hue='HomePlanet', palette='flare')# plt.scatter(data['Num'], data['HomePlanet'], s=1)plt.xlabel('HomePlanet')plt.ylabel('Num')# plt.show()Text(0, 0.5, 'Num')
xxxxxxxxxxWe notice how for low cabin numbers they seem to be concentrated in Europa while after a certain threshold earth becomes very dominant. We calculate then then threshold of cabin number for which the density of Earth surpasses Europa.
## Num and HOmePlanet#plot the distribution of the features Num vs HomePlanet using scatterplotplt.figure(figsize = (7,3))sns.stripplot(data=data, x="HomePlanet", y="Num", s=1, hue='HomePlanet', palette='flare')plt.xlabel('HomePlanet')plt.ylabel('Num')# count the number of passenger with HomePlanet == 'Europa' and Num < max_num_europathreshold = 0earth_under_thresh=0eur_under_thresh=0width=15for i in range(width, 500): count_eur = 0 count_earth = 0 dens_eur=0 dens_earth=0 for n in data['Num'][data['HomePlanet'] == 'Europa']: if n < i: count_eur += 1 if n > i-width: dens_eur+=1 for n in data['Num'][data['HomePlanet'] == 'Earth']: if n < i: count_earth += 1 if n > i-width: dens_earth+=1 #print(dens_eur, dens_earth) if dens_earth > dens_eur: threshold = i earth_under_thresh=count_earth eur_under_thresh=count_eur breakprint('Threshold = {}'.format(threshold))print('Count of Europa\'s under threshold = {}'.format(eur_under_thresh))print('Count of Earth\'s under threshold = {}'.format(earth_under_thresh))plt.axhline(y=threshold, color='black', linestyle='--')plt.show()Threshold = 315 Count of Europa's under threshold = 1933 Count of Earth's under threshold = 1040
xxxxxxxxxxWe see how before the threshold Europa has much more values than Earth so we can fill the missing 'HomePlanet' value with 'Europa' if the cabin number is below the threshold and with 'Earth' if above. This is a better guess then filling everything with the mode which would be 'Earth'.
#now fill the data['HomePlanet'] with Europa if data['Num'] > max_num_europa# else fill with Earthfor i in range(len(data['Num'])): # check if is nan, is a string otherwise if pd.isna(data['HomePlanet'][i]): if data['Num'][i] < threshold: data.loc[i, 'HomePlanet'] = 'Europa' else: data.loc[i, 'HomePlanet'] = 'Earth' print('Missing HomePlanet values: {}'.format( data['HomePlanet'].isna().sum()))Missing HomePlanet values: 0
xxxxxxxxxxLet's take a look at the missing values remaining.
data.isna().sum(axis=0)HomePlanet 0 CryoSleep 210 Destination 178 Age 0 VIP 197 RoomService 177 FoodCourt 178 ShoppingMall 206 Spa 181 VRDeck 184 Group 0 Member 0 Deck 0 Num 0 Side 0 Alone 0 dtype: int64
xxxxxxxxxxLet's focus on the services features
services = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']print("Mean: ")print(data[services].mean())print("------")print("Median: ")print(data[services].median())print("------")print("Mode: ")print(data[services].mode())Mean: RoomService 223.359958 FoodCourt 458.969558 ShoppingMall 173.852610 Spa 310.296152 VRDeck 305.450939 dtype: float64 ------ Median: RoomService 0.0 FoodCourt 0.0 ShoppingMall 0.0 Spa 0.0 VRDeck 0.0 dtype: float64 ------ Mode: RoomService FoodCourt ShoppingMall Spa VRDeck 0 0.0 0.0 0.0 0.0 0.0
xxxxxxxxxxFor everyone of the feature we have a mean of the order of the hundrends but the median and the mode equal to zero. This is beacuse, as we saw repeatedly in the first chapter, most of the passenger spend zero in a given feature while few spend a lot. So in this case filling the missing values with the average would contradict the logic of the initial distribution and the more correct choice is to fill the missing every missing value with 0 which is the mode for every feature.
data['RoomService'].fillna(data['RoomService'].median(), inplace=True)data['FoodCourt'].fillna(data['FoodCourt'].median(), inplace=True)data['ShoppingMall'].fillna(data['ShoppingMall'].median(), inplace=True)data['Spa'].fillna(data['Spa'].median(), inplace=True)data['VRDeck'].fillna(data['VRDeck'].median(), inplace=True)We know that passengers in CryoSleep are confined in their cabins so we know for certain that if a passenger has expenses firrent from 0 in the services like feature, it means his CryoSleep feature must be False. Though since for this feature also the mode is False as we have seen from the plot in Chapter 1, we fill every missing value with false without committing logical fallacies.We know that passengers in CryoSleep are confined in their cabins so we know for certain that if a passenger has expenses firrent from 0 in the services like feature, it means his CryoSleep feature must be False. Though since for this feature also the mode is False as we have seen from the plot in Chapter 1, we fill every missing value with false without committing logical fallacies.Out of curiosity we can see how many of the values we False with 100% certainty because they spent money.We know that passengers in CryoSleep are confined in their cabins so we know for certain that if a passenger has expenses firrent from 0 in the services like feature, it means his CryoSleep feature must be False. Though since for this feature also the mode is False as we have seen from the plot in Chapter 1, we fill every missing value with false without committing logical fallacies. Out of curiosity we can see how many of the values we False with 100% certainty because they spent money.
for i in range(len(data['CryoSleep'])): if pd.isna(data['CryoSleep'][i]): for serv in services: if data[serv][i]!=0: data.loc[i, 'CryoSleep']=False print('CryoSleep values 100% False: {}'.format(data['CryoSleep'].isna().sum())) CryoSleep values 100% False: 96
data['CryoSleep'].fillna(False, inplace=True)xxxxxxxxxxSince we have not found further correlations between datas the logical choice for filling the two categorical features 'Destination' and 'VIP' is to fill them with the most common value(mode). This is suggested also by the fact that for both features, as we have seen with the plots in Chapter 1, the mode is significantly more probable that the other values.
print('Mode for Destination: {}'.format(data['Destination'].mode()[0]))print('Mode for VIP: {}'.format(data['VIP'].mode()[0]))data['Destination'].fillna('TRAPPIST-1e', inplace=True)data['VIP'].fillna(False, inplace=True)Mode for Destination: TRAPPIST-1e Mode for VIP: False
data.isna().sum(axis=0)HomePlanet 0 CryoSleep 0 Destination 0 Age 0 VIP 0 RoomService 0 FoodCourt 0 ShoppingMall 0 Spa 0 VRDeck 0 Group 0 Member 0 Deck 0 Num 0 Side 0 Alone 0 dtype: int64
a = {}a["dead"] = [1,0,1]a["alive"] = [1,2,3]a["new"] = np.array(a['alive'] )+ np.array(a[ 'dead'])adata.keys()Index(['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP', 'RoomService',
'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Group', 'Member', 'Deck',
'Num', 'Side', 'Alone'],
dtype='object')xxxxxxxxxxWe now prepare our complete dataset for the further analysis which will include dimensionaly reduction which PCA and different clustering algorithims. So we must normalize the dataset.
xxxxxxxxxxMin-Max Normalization
Min-Max normalization is a simple normalization technique that re-scales a feature or observation value with distribution value between 0 and 1. The formula for min-max normalization is given as:
where
# normaize features using MinMaxScalerfrom sklearn.preprocessing import MinMaxScalerscaler = MinMaxScaler() data[num_features] = scaler.fit_transform(data[num_features])data.head()| HomePlanet | CryoSleep | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Group | Member | Deck | Num | Side | Alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Europa | False | TRAPPIST-1e | 0.493671 | False | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | B | 0.000000 | P | 1.0 |
| 1 | Earth | False | TRAPPIST-1e | 0.303797 | False | 0.007608 | 0.000302 | 0.001064 | 0.024500 | 0.001823 | 0.000108 | 0.000000 | F | 0.000000 | S | 1.0 |
| 2 | Europa | False | TRAPPIST-1e | 0.734177 | True | 0.003001 | 0.119948 | 0.000000 | 0.299670 | 0.002030 | 0.000216 | 0.000000 | A | 0.000000 | S | 0.0 |
| 3 | Europa | False | TRAPPIST-1e | 0.417722 | False | 0.000000 | 0.043035 | 0.015793 | 0.148563 | 0.007997 | 0.000216 | 0.142857 | A | 0.000000 | S | 0.0 |
| 4 | Earth | False | TRAPPIST-1e | 0.202532 | False | 0.021149 | 0.002348 | 0.006428 | 0.025214 | 0.000083 | 0.000323 | 0.000000 | F | 0.000528 | S | 1.0 |
xxxxxxxxxxOne-hot encoding is employed to represent categorical data numerically.
Here will be used to converts the categorical variables into binary vectors, where each category is uniquely represented by a binary digit (0 or 1). This method ensures that the model can interpret and process categorical information effectively, as each category is isolated and independently encoded in the feature space.\
#one hot encode the categorical featurescat_features = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Alone','Side']data_encoded = pd.get_dummies(data, columns=cat_features)data_encoded.head()| Age | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Group | Member | Num | HomePlanet_Earth | ... | Deck_C | Deck_D | Deck_E | Deck_F | Deck_G | Deck_T | Alone_0.0 | Alone_1.0 | Side_P | Side_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.493671 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | False | ... | False | False | False | False | False | False | False | True | True | False |
| 1 | 0.303797 | 0.007608 | 0.000302 | 0.001064 | 0.024500 | 0.001823 | 0.000108 | 0.000000 | 0.000000 | True | ... | False | False | False | True | False | False | False | True | False | True |
| 2 | 0.734177 | 0.003001 | 0.119948 | 0.000000 | 0.299670 | 0.002030 | 0.000216 | 0.000000 | 0.000000 | False | ... | False | False | False | False | False | False | True | False | False | True |
| 3 | 0.417722 | 0.000000 | 0.043035 | 0.015793 | 0.148563 | 0.007997 | 0.000216 | 0.142857 | 0.000000 | False | ... | False | False | False | False | False | False | True | False | False | True |
| 4 | 0.202532 | 0.021149 | 0.002348 | 0.006428 | 0.025214 | 0.000083 | 0.000323 | 0.000000 | 0.000528 | True | ... | False | False | False | True | False | False | False | True | False | True |
5 rows × 31 columns
xxxxxxxxxxIn this chapter ewe will perform a PCA in order to reduce dimensionality and that we will use different clustering algorithms to clusterize the data. We will then evaluate the results with internal and external indexes and try to interpret the results.
We will just use the label to visualize the plot of the pca, then we will evaluate the clustering algorithms using internal indexes using unsupervised learning.
In the end of this chapter we will perform a comparison between the clustering results and the true label.
xxxxxxxxxxPCA explaineation¶
Dimensionality Reduction
When dealing with datasets featuring numerous columns, the challenge is to visualize data in a 2D (3D) space without creating (X, Y) plots for every column pair. Dimensionality Reduction techniques help by condensing excess dimensions into a chosen, reduced representation.
PCA (Principal Component Analysis)
PCA is a key technique for this. It identifies a new dataset representation capturing a fraction of the original variance, albeit with some loss. PCA finds directions where the dataset varies the most—designated as Principal Components (PCs). PC1 is the direction with the most variance, PC2 follows, and so on. These become features in our reduced space, also seen as optimal linear combinations of original features.
from sklearn.decomposition import PCApca = PCA(random_state = 72)pca.fit(data_encoded)x_pca = pca.transform(data_encoded)x_pca = pd.DataFrame(x_pca)x_pca[:5]| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.077066 | -0.710779 | 0.560092 | 0.854842 | -0.272298 | -1.004628 | -0.314941 | 0.352205 | 0.078219 | -0.227002 | ... | -0.041407 | -0.000976 | -0.018124 | -3.080153e-16 | -1.655609e-16 | -3.201284e-17 | -9.403843e-17 | -7.054026e-17 | 3.176937e-17 | 2.179769e-16 |
| 1 | -0.876438 | -0.334331 | -0.916429 | 0.055874 | 0.054850 | -0.009472 | -0.380416 | 0.753599 | -0.111264 | -0.113808 | ... | -0.008212 | -0.000717 | -0.008460 | -7.209287e-17 | 7.911453e-18 | -4.257268e-18 | -5.240507e-17 | -1.815626e-16 | -7.925293e-17 | 2.110380e-16 |
| 2 | 1.090366 | -0.653358 | -0.704001 | -0.215156 | -0.398348 | -0.705568 | -0.096031 | 0.315992 | 0.238497 | -0.342152 | ... | -0.024878 | -0.004481 | -0.000632 | 3.892943e-17 | 1.848532e-16 | -1.985463e-16 | 1.557618e-16 | -4.278469e-17 | 1.427917e-16 | 6.065549e-16 |
| 3 | 1.055169 | -0.583633 | -0.697894 | -0.298656 | -0.362532 | -0.657613 | -0.126458 | 0.380814 | 0.138526 | -0.241338 | ... | -0.019613 | -0.005746 | 0.009946 | 3.892943e-17 | -2.496887e-18 | -2.124241e-16 | -2.328163e-16 | 1.272646e-17 | 8.728052e-17 | 3.428769e-16 |
| 4 | -0.877580 | -0.330454 | -0.915339 | 0.047740 | 0.055188 | -0.004906 | -0.377789 | 0.753878 | -0.116294 | -0.112944 | ... | 0.003702 | -0.001580 | -0.002424 | -1.658172e-17 | 3.566703e-17 | -4.257268e-18 | -5.240507e-17 | -1.538070e-16 | -7.925293e-17 | 1.971602e-16 |
5 rows × 31 columns
#impirt preprocessor.fit_transformfrom sklearn import preprocessingpreprocessor = preprocessing.StandardScaler()cov_mat = np.cov(preprocessor.fit_transform(data_encoded,label).T)eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)tot = sum(eigen_vals)var_exp = [(i/tot) for i in sorted(eigen_vals, reverse=True)]cum_var_exp = np.cumsum(var_exp)plt.bar(data_encoded.keys(), var_exp, alpha = 0.3, align = 'center', label = 'Individual explained variance', color = '#00f830')#rotate the xticksplt.xticks(rotation=90)plt.step(range(1, len(var_exp)+1), cum_var_exp, where = 'mid', label = 'Cumulative explained variance' , color = 'mediumpurple', linestyle = '--', marker = 'o')plt.ylabel('Explained variance ratio')plt.xlabel('Principal component index')plt.legend(loc = 'best')plt.tight_layout()plt.show();/opt/conda/lib/python3.10/site-packages/matplotlib/transforms.py:762: ComplexWarning: Casting complex values to real discards the imaginary part /opt/conda/lib/python3.10/site-packages/matplotlib/cbook/__init__.py:1345: ComplexWarning: Casting complex values to real discards the imaginary part
xxxxxxxxxxFrom the cumulative explained variance plot we see that the last 10 components do not contribute to the total information. We will then drop them.
#remove the last 10 components from the x_pcaif len(x_pca.columns) > 10: x_pca = x_pca.iloc[:,:-10]x_pca.head()| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.077066 | -0.710779 | 0.560092 | 0.854842 | -0.272298 | -1.004628 | -0.314941 | 0.352205 | 0.078219 | -0.227002 | ... | -0.294517 | -0.170950 | -0.066587 | 0.002210 | 0.046139 | 0.012872 | 0.191104 | -0.078250 | -0.004003 | 0.001536 |
| 1 | -0.876438 | -0.334331 | -0.916429 | 0.055874 | 0.054850 | -0.009472 | -0.380416 | 0.753599 | -0.111264 | -0.113808 | ... | 0.067034 | -0.000042 | -0.029945 | -0.084908 | -0.052148 | 0.024825 | -0.100397 | 0.006014 | 0.016737 | 0.000888 |
| 2 | 1.090366 | -0.653358 | -0.704001 | -0.215156 | -0.398348 | -0.705568 | -0.096031 | 0.315992 | 0.238497 | -0.342152 | ... | 0.185921 | 1.621649 | 0.281813 | 0.190354 | -0.108205 | -0.062806 | 0.288322 | 0.156220 | 0.211082 | -0.041873 |
| 3 | 1.055169 | -0.583633 | -0.697894 | -0.298656 | -0.362532 | -0.657613 | -0.126458 | 0.380814 | 0.138526 | -0.241338 | ... | 0.105992 | 0.352540 | 0.922593 | 0.007845 | -0.125879 | 0.073307 | 0.281807 | 0.039786 | 0.107703 | -0.003873 |
| 4 | -0.877580 | -0.330454 | -0.915339 | 0.047740 | 0.055188 | -0.004906 | -0.377789 | 0.753878 | -0.116294 | -0.112944 | ... | 0.062707 | -0.007881 | -0.025717 | -0.184547 | -0.056260 | 0.018823 | -0.099141 | 0.005534 | 0.018975 | -0.007617 |
5 rows × 21 columns
xxxxxxxxxxHere is shown the plot of the explained variance ratio for every principal component. We see that the last 9 components do not contribute to increase the cumulative variance, so we will drop them.Here is shown the plot of the explained variance ratio for every principal component. We see that the last 9 components do not contribute to increase the cumulative variance, so we will drop them.
The cumulative variance plot has provided valuable insights, indicating that over 80% of the variance in the data can be explained by the first 20 principal components in our Principal Component Analysis (PCA). Consequently, it has been determined that utilizing all these components is crucial for an effective clustering process.The cumulative variance plot has provided valuable insights, indicating that over 80% of the variance in the data can be explained by the first 20 principal components in our Principal Component Analysis (PCA). Consequently, it has been determined that utilizing all these components is crucial for an effective clustering process.While the clustering algorithms were applied using all 20 components to ensure comprehensive analysis, practical visualization constraints necessitate a focus on the first 3 components for plotting. This decision arises from the difficulty of visually interpreting data in dimensions beyond three.It is imperative to bear in mind that, despite the presentation being limited to the first 3 components, the clustering outcomes are based on the entirety of non-dropped components. This strategy ensures a robust clustering analysis while addressing the challenges associated with visualizing high-dimensional data.The cumulative variance plot has provided valuable insights, indicating that over 80% of the variance in the data can be explained by the first 20 principal components in our Principal Component Analysis (PCA). Consequently, it has been determined that utilizing all these components is crucial for an effective clustering process.
While the clustering algorithms were applied using all 20 components to ensure comprehensive analysis, practical visualization constraints necessitate a focus on the first 3 components for plotting. This decision arises from the difficulty of visually interpreting data in dimensions beyond three.
It is imperative to bear in mind that, despite the presentation being limited to the first 3 components, the clustering outcomes are based on the entirety of non-dropped components. This strategy ensures a robust clustering analysis while addressing the challenges associated with visualizing high-dimensional data.
xxxxxxxxxxNow we will consider just the first 3 dimensions of the PCA and plot the data in a 3D space.
Let's see the explained variance ratio of the first 3 components.
# total var by using the first 3 componentsprint('Total variance explained by the first 3 components: {}'.format(np.sum(pca.explained_variance_ratio_[:2])))total_var = np.sum(pca.explained_variance_ratio_[:2])Total variance explained by the first 3 components: 0.33997169121920506
xxxxxxxxxxPlot the first 3 components and use the label 'Transported' to color the points.
fig = px.scatter_3d( x_pca, x = 0, y = 1, z = 2, color=label, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {100*total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxThis is what we will try to obtain with the clustering algorithms.
xxxxxxxxxxLet's now start with the clustering algorithms. We will use the following algorithms:
- DBSCAN: Density-Based Spatial Clustering of Applications with Noise
- K-Means: K-Means Clustering, a clustering algorithm that aims to partition n observations into k clusters.
- Agglomerative Clustering: Agglomerative Clustering, a clustering algorithm that aims to group the data points into a hierarchical cluster.
- Gaussian Mixture: Gaussian Mixture Model, a clustering algorithm that assumes that the data is formed by Gaussian distributions.
xxxxxxxxxxDBSCAN is a density-based clustering algorithm. It groups together data points that are close to each other and have a sufficient number of neighboring points within a specified distance. It can identify clusters of arbitrary shapes and is robust to outliers, classifying points as core, border, or noise. It takes ywo parameters
Parameters:
1) Epsilon (eps): Radius defining the neighborhood around a data point.
2) MinPoints: Minimum number of data points required to form a dense region (core point).
xxxxxxxxxxIn order to determine the two parameters we can try doing a for loop on a reasonable range of the two, applying DBScan and compute the goodness of the clusterization through the Silhouette score for every combination.
# compute siluette score in function of epsfrom sklearn.cluster import DBSCANfrom sklearn import metricseps_l = np.linspace(0.1, 1, 5)min_samples_l = np.linspace(1, 100, 5)#meshgrideps, min_samples = np.meshgrid(eps_l, min_samples_l)siluette = []for i in range(len(eps)): for j in range(len(min_samples)): db = DBSCAN(eps=eps[i][j], min_samples=int(min_samples[i][j])).fit(x_pca) if len(set(db.labels_)) == 1: siluette.append(0) continue labels = db.labels_ siluette.append(metrics.silhouette_score(x_pca, labels)) #print(i,j)siluette = np.array(siluette)# siluette = siluette.reshape(10,10)#print(siluette)# Reshape silhouette scores to match the meshgridsilhouette_matrix = siluette.reshape(eps.shape)# Plotting the silhouette scoresplt.figure(figsize=(8, 6))plt.imshow(silhouette_matrix, cmap='viridis', origin='lower', extent=(0, 100, 0, 100))plt.xticks(eps_l*100)plt.yticks(min_samples_l)plt.colorbar(label='Silhouette Score')plt.title('Silhouette Score for DBSCAN Parameters')plt.xlabel('100*Epsilon')plt.ylabel('Min Samples')plt.show()xxxxxxxxxxSo we can try now to run it with eps=0.7, min_samples=5.
# try to clusterize data using the density based clustering algorithm DBSCAN the pca datafrom sklearn.cluster import DBSCANfrom sklearn import metrics# ############################################################################## Compute DBSCANdb = DBSCAN(eps=0.7, min_samples=5).fit(x_pca) # 0.1, 35core_samples_mask = np.zeros_like(db.labels_, dtype=bool)core_samples_mask[db.core_sample_indices_] = Truelabels_DBS = db.labels_#print(labels_DBS)# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels_DBS)) - (1 if -1 in labels_DBS else 0)print("Number of clusters:", n_clusters_)print('Estimated number of clusters: %d' % n_clusters_)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_DBS))#print("Calisnki Harabasz score: %0.3f" #% metrics.calinski_harabasz_score(x_pca, labels_DBS))# ############################################################################## Plot result using plotlyimport plotly.express as pxfig = px.scatter_3d( x_pca, x = 0, y = 1, z=2, color=labels_DBS, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxWe recognize that the data, at least visually is composed by small and dense sparse regions, this is why DBScan has the necessity to increase significantly the number of clusters, in order to provide a decent silhouette score. So we can say that trying to maximize the Silhouette score leads us to a not satisfying result. A number of clusters of 176 is definetely to high for our purposes. But again as a consequence of the shape of the data, if we run the algorithm with a larger min_samples and lower epsilon we obtain a low number of cluster, but with one of them occupying almost all the dataset, and the others being formed by small region pf points.
Let's run it with eps=0.1, min_samples=35
db2 = DBSCAN(eps=0.1, min_samples=35).fit(x_pca) # 0.1, 35core_samples_mask2 = np.zeros_like(db2.labels_, dtype=bool)core_samples_mask2[db2.core_sample_indices_] = Truelabels_DBS2 = db2.labels_#print(labels_DBS2)# Number of clusters in labels, ignoring noise if present.n_clusters_2 = len(set(labels_DBS2)) - (1 if -1 in labels_DBS2 else 0)print("Number of clusters:", n_clusters_2)print('Estimated number of clusters: %d' % n_clusters_2)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_DBS2))#print("Calisnki Harabasz score: %0.3f" #% metrics.calinski_harabasz_score(x_pca, labels_DBS2))# #############################################################################import plotly.express as pxfig = px.scatter_3d( x_pca, x = 0, y = 1, z=2, color=labels_DBS2, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxWe can conclude that the sensitivity of DBScan to the outliers and small density regions does not make it a good algorithm for our specific dataset.
xxxxxxxxxxLet's now try other clustering methods. For the followings we will use three different internal indexes, defined below, to check the best number of clustering: Silhouette Score, Calinski-Harabasz index and Davies-Bouldin index.
xxxxxxxxxxInterpreting the results¶
Silhouette Score:
- The silhouette score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1.
- Interpretation:
- A score near +1 indicates good matching within the cluster and poor matching to neighboring clusters.
- A score around 0 suggests overlapping clusters.
- A score less than 0 may indicate potential misclassification.
Davies-Bouldin Index:
- The Davies-Bouldin Index measures the average similarity index of each cluster with the cluster that is most similar to it.
- Interpretation:
- Lower values (closer to 0) indicate better clustering.
- Clusters should be well-separated and compact.
Calinski-Harabasz
- The Calinski-Harabasz index is the ratio of intergroup variance to intragroup variance. Goes from zero to infinity.
- Interpretation:
- Value increases as classification becomes better.
xxxxxxxxxxK-Means is a partitioning-based clustering algorithm that aims to partition data points into k clusters. It minimizes the sum of squared distances between data points and the centroid of their assigned cluster. It iswidely used, but it assumes spherical clusters and requires specifying the number of clusters (k) in advance.
xxxxxxxxxxWe start by calculating the different scores by number of clusters and plotting them.
from sklearn.cluster import KMeansfrom sklearn import metricsfrom sklearn.datasets import make_blobsfrom sklearn.preprocessing import StandardScalerk = np.linspace(2, 200, 30)k = k.astype(int)calinski = []siluette = []davis = []#another scorefor i in range(len(k)): km = KMeans(n_clusters=k[i], random_state=0, n_init='auto').fit(x_pca); labels = km.labels_ calinski.append(metrics.calinski_harabasz_score(x_pca, labels)) siluette.append(metrics.silhouette_score(x_pca, labels)) davis.append(metrics.davies_bouldin_score(x_pca, labels)) #print(i/k.shape[0]*100, '%')calinski = np.array(calinski)#print(calinski)siluette = np.array(siluette)#print(siluette)davis = np.array(davis)#print(davis)# Reshape Calinski-Harabasz scores to match the meshgridnp.array(davis)calinski_matrix = calinski.reshape((len(k), 1))# Plotting the Calinski-Harabasz scores as scatter plotplt.figure(figsize=(16, 5))plt.subplot(1, 3, 1)plt.plot(k, calinski_matrix, color= 'mediumpurple')plt.title('Calinski-Harabasz Score for KMeans')plt.xlabel('k')plt.ylabel('Calinski-Harabasz Score')plt.subplot(1, 3, 2)plt.plot(k, siluette, color= 'mediumpurple')plt.title('Silhouette Score for KMeans')plt.xlabel('k')plt.ylabel('Silhouette Score')plt.subplot(1, 3, 3)plt.plot(k, davis, color= 'mediumpurple')plt.title('Davies-Bouldin Score for KMeans')plt.xlabel('k')plt.ylabel('Davies-Bouldin Score')plt.show()xxxxxxxxxxWe get apparently contradicting answers: this could be due to the fact that K-means assumes spherical shapes of the data while we clearly have more sparse and high dimensional data. The silhouette indedx starts to have significant values only for a very high number of clusters, which would give us a similar value as DBScan. Instead we notice that the Calinski Harabatz score gets better the less the clusters are. We can try to clusterize with k=2 and show the results.
# ############################################################################## Compute KMeanskm = KMeans(n_clusters=2, random_state=0, n_init='auto').fit(x_pca)labels_KMN = km.labels_#print(labels_KMN)# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels_KMN)) - (1 if -1 in labels_KMN else 0)print("Number of clusters:", n_clusters_)print('Estimated number of clusters: %d' % n_clusters_)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_KMN))print("Calisnky Harabatz score: %0.3f" % metrics.calinski_harabasz_score(x_pca, labels_KMN))# ############################################################################## Plot result using plotlyimport plotly.express as pxfig = px.scatter_3d( x_pca, x = 0, y = 1, z=2, color=labels_KMN, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxWe indeed notice that the separation between clusters is low inexistent, reason why the silhouette score is close to zero. However the structure os the two clusters resembles more the actual labels
xxxxxxxxxxGMM is a probabilistic model-based clustering algorithm. It assumes that the data points are generated from a mixture of several Gaussian distributions. GMM assigns probabilities to each data point belonging to different clusters and allows for flexible cluster shapes. It provides soft assignments, indicating the likelihood of a point belonging to each cluster. The parameter is number of components: the predetermined number of Gaussian distributions.
xxxxxxxxxxLet's use internal indexes to determine the best number of components.
from sklearn.mixture import GaussianMixturefrom sklearn.mixture import GaussianMixturefrom sklearn import metricsn_components = np.linspace(2, 100, 9)calinski = []siluette = []davis = []for i in range(len(n_components)): gm = GaussianMixture(n_components=int(n_components[i]), random_state=0).fit(x_pca) labels = gm.predict(x_pca) calinski.append(metrics.calinski_harabasz_score(x_pca, labels)) siluette.append(metrics.silhouette_score(x_pca, labels)) davis.append(metrics.davies_bouldin_score(x_pca, labels)) #print(i/n_components.shape[0]*100, '%')calinski = np.array(calinski)#print(calinski)siluette = np.array(siluette)#print(siluette)davis = np.array(davis)#print(davis)xxxxxxxxxx# Plotting the Calinski-Harabasz scores as scatter plotplt.figure(figsize=(16, 5))plt.subplot(1, 3, 1)plt.plot(n_components, calinski, color= 'mediumpurple')plt.title('Calinski-Harabasz Score for GaussianScanning')plt.xlabel('n_components')plt.ylabel('Calinski-Harabasz Score')plt.subplot(1, 3, 2)plt.plot(n_components, siluette, color= 'mediumpurple')plt.title('Silhouette Score for GaussianScanning')plt.xlabel('n_components')plt.ylabel('Silhouette Score')plt.subplot(1, 3, 3)plt.plot(n_components, davis, color= 'mediumpurple')plt.title('Davies-Bouldin Score for GaussianScanning')plt.xlabel('n_components')plt.ylabel('Davies-Bouldin Score')plt.show()xxxxxxxxxxWe see that..( ). Let's clusterize and plot using 6 components which seems a reasonable value for all indexing methods.
xxxxxxxxxx#do a gaussian scanning of the pca# ############################################################################## Compute GaussianMixturegm = GaussianMixture(n_components=6, random_state=0).fit(x_pca)labels_GSS = gm.predict(x_pca)print(labels)# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels_GSS)) - (1 if -1 in labels_GSS else 0)print("Number of clusters:", n_clusters_)print('Estimated number of clusters: %d' % n_clusters_)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_GSS))# ############################################################################## Plot result using plotlyimport plotly.express as pxfig = px.scatter_3d( x_pca, x = 0, y = 1, z = 2, color=labels, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxWhat do we obtain other that fucking low silhouette score?
xxxxxxxxxxAgglomerative clustering is a type of hierarchical clustering thats builds bottom-up or divisive (top-down) approaches. Each data point starts as a separate cluster, and the algorithm merges the closest clusters iteratively. It needs a linkage method which determines how to measure the distance between clusters during merging, and the number of clusters
xxxxxxxxxxLet's compute and plot the indexes once again
#calculate the silhouette score and calinski harabasz score for the hierarcical clusteringfrom sklearn.cluster import AgglomerativeClusteringfrom sklearn import metricsn_clusters = np.linspace(2, 50, 9)siluette = []calinski = []davis = []for i in range(len(n_clusters)): ac = AgglomerativeClustering(n_clusters=int(n_clusters[i])).fit(x_pca) labels = ac.labels_ siluette.append(metrics.silhouette_score(x_pca, labels)); calinski.append(metrics.calinski_harabasz_score(x_pca, labels)); davis.append(metrics.davies_bouldin_score(x_pca, labels)); #print(i/n_clusters.shape[0]*100, '%')siluette = np.array(siluette)calinski = np.array(calinski)davis = np.array(davis)# Plotting the Calinski-Harabasz scores as scatter plotplt.figure(figsize=(16, 5))plt.subplot(1, 3, 1)plt.plot(n_clusters, calinski, color= 'mediumpurple')plt.xlabel('n_clusters')plt.ylabel('Calinski-Harabasz Score')plt.subplot(1, 3, 2)plt.plot(n_clusters, siluette, color= 'mediumpurple')plt.xlabel('n_clusters')plt.ylabel('Silhouette Score')plt.subplot(1, 3, 3)plt.plot(n_clusters, davis, color= 'mediumpurple')plt.xlabel('n_clusters')plt.ylabel('Davies-Bouldin Score')plt.show()xxxxxxxxxxsame result as kmeans.
# hierarcical clustering on pcafrom sklearn.cluster import AgglomerativeClusteringfrom sklearn import metrics# ############################################################################## Compute AgglomerativeClusteringac = AgglomerativeClustering(n_clusters=3).fit(x_pca)labels_AGG = ac.labels_print(labels)# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels_AGG)) - (1 if -1 in labels_AGG else 0)print("Number of clusters:", n_clusters_)print('Estimated number of clusters: %d' % n_clusters_)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_AGG))# ############################################################################## Plot result using plotlyimport plotly.express as pxfig = px.scatter_3d( x_pca, x = 0, y = 1, z=2, color=labels_AGG, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500 )fig.show()xxxxxxxxxxHere is the plot of the corresponding dendogram which shows the hierarchy of clusters.
# Tipical graph of the dendrogramimport scipy.cluster.hierarchy as shcplt.figure(figsize=(10, 7))plt.title("Dendograms")dend = shc.dendrogram(shc.linkage(x_pca, method='ward'))xxxxxxxxxxWe can conclude that the unsupervised clusterization does not bring satisfoctory results: in order to have decent internal indexes scores the number of clusters have to be in the order of hundreds, while for reasonable cluster numbers the data is badly separated. This is probably due to two reason:
1) Non convexity of the clusters and sparseness of the data 2) The inherent complexity and non-separability of the dataset.
xxxxxxxxxxNevertheless we thing some meaninful results can be reached comparing the clusterization to the actual labels.
xxxxxxxxxxHere we would like to compare our clustering results with the true label of the dataset, which were 'Transported' and 'Not Transported'. And resort to the visual inspection of the clusters.
xxxxxxxxxxWe tried before the K-Means algorithm with 2 clusters. If we try to compare the obtained labels to the actual ones and we compute the rand_score we obtain:
#compare label with label_KMN using Adjusted Rand Indexfrom sklearn.metrics.cluster import adjusted_rand_scoreadjusted_rand_score(label, labels_KMN)xxxxxxxxxxThe adjusted rand index is a function that measures the similarity of the two assignments, ignoring permutations and with chance normalization. Since we got 0.01 we can say that the two assignments are not similar at all, so the clustering algorithm did not perform well. However let's inspect the plot of the true labels.
fig = px.scatter_3d( x_pca, x = 0, y = 1, z = 2, color=label, size=0.1*np.ones(len(data_encoded)), opacity = 1, size_max=5, title=f'Total Explained Variance: {100*total_var:.2f}%', labels={'0': 'PC 1', '1': 'PC 2'}, width=800, height=500)fig.show()xxxxxxxxxxWe notice that True and False do not completely overlap, instead there is a region towards high value of the first component with significantly more True, another one towards low value of both components which has much more False, and a central overlapped region. We can see if running K-Means with 3 clusters, (which will produce a low silhouette score but a high Calinsky-Rabasasz score), will produce something similar to what we deduce from visual inspection.
xxxxxxxxxxLet's then run kmean with k=3 clusters.
# ############################################################################## Compute AgglomerativeClusteringac = AgglomerativeClustering(n_clusters=3).fit(x_pca)labels_AGG = ac.labels_print(labels)# Number of clusters in labels, ignoring noise if present.n_clusters_ = len(set(labels_AGG)) - (1 if -1 in labels_AGG else 0)print("Number of clusters:", n_clusters_)print('Estimated number of clusters: %d' % n_clusters_)print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(x_pca, labels_AGG))# #############################################################################fig, axes = plt.subplots(1, 2, figsize=(14, 4))axes[0].scatter(x_pca[0], x_pca[1], c=label, s=1 , cmap='bwr')cbar = plt.colorbar(axes[0].scatter(x_pca[0], x_pca[1], c=label, s=1, cmap='bwr'))axes[0].set_title('True label')axes[0].set_xlabel('PC 1')axes[0].set_ylabel('PC 2')axes[1].scatter(x_pca[0], x_pca[1], c=labels_AGG, s=1, cmap='cool')#add colorbarcbar = plt.colorbar(axes[1].scatter(x_pca[0], x_pca[1], c=labels_AGG, s=1, cmap='cool'))axes[1].set_title('Agglomerative Clustering label')axes[1].set_xlabel('PC 1')axes[1].set_ylabel('PC 2')plt.show()xxxxxxxxxxWe notice from this two dimensional projection of the plot that the three clusters produced resemble the three regions described. At the price of excluding the points belonging to the central blue cluster, we can interpret the violet as True and the cyan as False. We can then compute the confusion matrix and the resulting accuracy.
# compute cofusion matrix using the labels of the cluster and the true labelsfrom sklearn.metrics import confusion_matrixconfusion_matrix(label, labels_AGG)#plot the confusion matriximport seaborn as snsplt.figure(figsize = (10,7))sns.heatmap(confusion_matrix(label, labels_AGG), annot=True, cmap='Blues', fmt='g')# change label xticks# plt.xticks(np.arange(0.5,3.5,1), ('Violet', 'Pink', 'Cyan'))# plt.yticks(np.arange(0.5,3.5,1), ('NON Transported', 'Transported', '---'))plt.xlabel('Predicted')plt.ylabel('True')plt.show()xxxxxxxxxxIndeed we notice a clear tendency. Excluding the confused cluster, the other two guess correctly two times more labels that the ones they miss. The resluting accuracy is then a bit higher than 0,6. This is indeed not a satisfactory result for practical purposes but can be considered valuable considering the intrinsic complexity of the dataset to which only unsupervised analysis could be conducted.
print("Acc: ", confusion_matrix(label, labels_AGG)[0][0]/(confusion_matrix(label, labels_AGG)[0][0]+confusion_matrix(label, labels_AGG)[0][1]+confusion_matrix(label, labels_AGG)[0][2]))- __notebook_source__.ipynb